4  Text Analysis

Important

This book is a first draft, and I am actively collecting feedback to shape the final version. Let me know if you spot typos, errors in the code, or unclear explanations, your input would be greatly appreciated. And your suggestions will help make this book more accurate, readable, and useful for others. You can reach me at:
Email:
LinkedIn: www.linkedin.com/in/jorammutenge
Datasets: Download all datasets

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Jamie Zawinski


In the previous chapter, we focused on how dates and numerical values support time series analysis. However, many datasets contain more than numbers tied to timestamps. They often include descriptive attributes or free-form text, with character-based columns holding valuable information. Although Polars is well known for its speed in numerical operations such as aggregation, counting, and averaging, it also provides strong tools for working with text.

This chapter begins with an overview of the types of text-related tasks that Polars handles effectively, along with situations where other libraries may be a better fit. After that, we examine a dataset of YouTube comments. We then walk through code examples that cover analyzing text properties, parsing with Polars, applying transformations, assembling new strings from components, and searching within larger text blocks using techniques like regular expressions.

4.1 Why Text Analysis with Polars?

Large amounts of information are produced every day, and much of it appears as text. Text may range from a single word to long documents. Its sources include human-written descriptions, application-generated logs, customer feedback, support tickets, social media updates, and news articles. This data may be structured with clearly defined columns, semistructured with partial organization that requires cleaning, or unstructured with long passages or BLOBs that need substantial processing. Polars offers a flexible set of functions that work well across these formats and allow you to transform, organize, and analyze text efficiently.

4.1.1 Why Polars Is a Good Choice for Text Analysis

Several reasons make Polars a practical option for text work. One is scale. When a dataset grows too large for Excel to handle efficiently, Polars becomes a much better alternative. For example, I once worked with a team that stored customer feedback in several Excel files. They wanted to merge everything into one dataset and prepare it for a reporting dashboard. Cleaning and combining the files in Excel was slow and frustrating because each adjustment took several seconds to apply. Switching to Polars made the entire workflow faster and much easier to manage.

Another advantage is a reduced risk of human error. Because Polars avoids manual copying or pasting, the original dataset stays intact. All transformations occur in memory, and the source file is unchanged unless you choose to overwrite it. To protect the raw data, it is best to save processed outputs under a new filename so you can return to the original whenever needed.

Polars is also well suited for tasks that require quantification. You might want to count how many product reviews mention a phrase such as “slow shipping” or categorize survey responses based on recurring themes. Polars makes it straightforward to clean and structure text fields. Cleaning may involve trimming extra spaces, fixing capitalization, or normalizing spelling. Structuring may involve generating new columns by extracting keywords or combining fragments from different fields. The string functions in Polars can be chained to support complex manipulations with relative ease.

Text analysis in Polars is rule-driven. The computer follows explicit instructions that you write in code. This approach differs from machine learning in which the system adapts to patterns in the data. Rules have the advantage of being transparent and easy to verify. However, they can become difficult to manage as exceptions accumulate. A short when-then-otherwise expression can expand to many lines as the dataset evolves. Maintaining these rules often requires close collaboration with developers.

Polars is especially effective when you know exactly what you want to extract. Its functions, including regular expressions, help you locate, isolate, or replace precise text patterns. A question such as “How many reviews mention ‘slow shipping’?” is well suited to Polars. A broader question such as “What makes these customers dissatisfied?” is far more difficult to answer with rules alone.

4.1.2 When Polars Is Not a Good Choice

Despite its strengths, Polars is not the right tool for every text analysis task.

The first situation involves tasks that depend on human judgment. If the dataset is small or recently collected, manual labeling may be faster and more insightful. When the goal is to review all entries and prepare a qualitative summary, human interpretation often provides richer results than automated processing.

The second situation concerns rapid search and retrieval. Systems such as Apache Solr and Graylog are designed for efficient indexing and querying of text. Because Polars is columnar, it may not deliver the performance needed for low-latency searches.

The third situation relates to broader natural language processing tasks. Libraries such as spaCy or nltk are more appropriate when you need sentiment analysis, part-of-speech tagging, or language generation. Polars can support simple keyword-based sentiment rules, but the variety of expressions and the complexity of negation make rule-based methods impractical. It can also concatenate strings, but it cannot learn from data or generate new text in the way machine learning models can.

With these strengths and limitations in mind, we can now move on to the dataset that will serve as the foundation for the Polars examples in the rest of this chapter.

4.2 The YouTube Comments Dataset

In this chapter, we work with a dataset of YouTube comments collected from videos created by several well-known data-focused YouTubers: Avery Smith, Ken Jee, Luke Barousse, Agatha, Tom Gadsby, and Stefanovic. The selected videos are those in which each creator shared their perspective on the question How would I learn data analysis? The dataset contains close to 5,000 comments, which represent feedback from viewers who chose to respond to these videos.

The raw files used to build the youtube_comments dataframe are stored in JSON format. Polars can read JSON along with many other file types. Although the dataset includes 11 columns, we will focus on a smaller subset. We also introduce a new column, Source, which records the filename of each JSON file so we can identify the video associated with every comment.

import polars as pl
from pathlib import Path

files = Path('data').glob('*.json')

dfs = []
for file in files:
1    df = (pl.read_ndjson(file)
     .with_columns(source=pl.lit(file.name))
     .rename(lambda c: c.title())
    )
    dfs.append(df)

comments_df = pl.concat(dfs)
1
The files use a newline-delimited JSON format. This is why we read them with read_ndjson.

Next, we construct the youtube_comments dataframe by selecting the columns needed for our analysis. One transformation converts Time_Parsed from an epoch timestamp into a proper datetime column, which we rename to Date for clarity.

youtube_comments = (comments_df
 .select(pl.from_epoch('Time_Parsed').alias('Date'),
         'Time','Text','Author','Source')
 )
youtube_comments
shape: (4_839, 5)
Date Time Text Author Source
datetime[μs] str str str str
2022-12-11 16:04:44 "3 years ago (edited)" "It took me 7 years of working … "@Stefanovic92" "stefanovic.json"
2025-06-11 15:04:44 "6 months ago" "1:50 Fiat Multipla is indeed t… "@sasker1000" "stefanovic.json"
2025-06-11 15:04:44 "6 months ago" "Bro has beautiful eyes for no … "@cpblair" "stefanovic.json"
2025-06-11 15:04:44 "6 months ago (edited)" "Man, I did it. I actually list… "@cometamítico" "stefanovic.json"
2025-03-11 15:04:44 "9 months ago" "I think I clicked on this vide… "@Hervoicehealed" "stefanovic.json"
2025-04-11 14:49:31 "8 months ago" "Like with women, the numbers g… "@kknight1990" "avery_smith.json"
2024-12-11 15:49:31 "1 year ago" "I can answer that if you want … "@vvs1998" "avery_smith.json"
2024-12-11 15:49:31 "1 year ago (edited)" " @vvs1998 Tableau Public is fr… "@Nikopup" "avery_smith.json"
2025-01-11 15:49:31 "11 months ago (edited)" "​Thanks ​ @vvs1998 for your ho… "@adedapomichaelowoeye9353" "avery_smith.json"
2025-01-11 15:49:31 "11 months ago" "Thanks" "@Esss8499" "avery_smith.json"

Throughout this chapter, I show how to clean and prepare these columns and how to run several analyses on them. In a production workflow, it would make sense to design an ETL pipeline that processes the data in a consistent way and saves the structured output into a new dataset. For our purposes, we will continue working directly with the raw data.

Let’s begin by using Polars to examine and describe the text contained in the comments.

4.3 Text Characteristics

In Polars, the String type is the most flexible because nearly any form of textual data can be stored in it. Both List and Struct types, for example, can hold strings. As with most datasets, an early step in the workflow is to profile the text so we understand its properties. This helps determine what cleaning and parsing steps will be necessary later.

A simple way to start exploring the text is to measure the number of characters in each entry using the len_chars function. This function counts characters in every cell of a column. Another related function, len_bytes, calculates the number of bytes used to store the text. This distinction becomes important when working with emojis or non-Latin scripts, since a single emoji or Chinese character may occupy multiple bytes. The dataframe below contains standard text as well as an emoji.

shape: (2, 1)
Text
str
"👍"
"Sample comment"

Now let’s write Polars code that computes both character counts and byte counts for each row.

(df
 .with_columns(Chars=pl.col('Text').str.len_chars(),
               Bytes=pl.col('Text').str.len_bytes())
 )
shape: (2, 3)
Text Chars Bytes
str u32 u32
"👍" 1 4
"Sample comment" 14 14
Note

For ordinary text, the character count and byte count are identical. The emoji example shows that a single character can require 4 bytes.

By plotting the distribution of comment lengths, we can see the typical size of comments and identify unusually long entries that may need special handling.

to_plot = (youtube_comments
 .with_columns(Length=pl.col('Text').str.len_chars())
 .group_by('Length')
 .agg(Records=pl.len())
 .sort('Length')
)
to_plot
shape: (573, 2)
Length Records
u32 u32
1 23
2 10
3 11
4 20
5 16
1830 1
2555 1
2858 2
3616 1
8251 1

The chart below shows that most comments fall between 1 and 500 characters, and only a small number exceed 1,000 characters. The dataset spans very short comments of a single character up to lengthy ones of around 8,000 characters. This wide range suggests substantial variation in the text column before any deeper profiling.

import hvplot.polars

(to_plot
 .hvplot.bar(x='Length', y='Records', height=400,
             title='Distribution of comments length')
 )

Let’s examine a few example entries from the Time column. If this were a spreadsheet, you might scroll through dozens of rows to get a sense of the data. These samples, however, are enough to understand what the column contains:

print(youtube_comments.get_column('Time')[0])
print(youtube_comments.get_column('Time')[1])
3 years ago (edited)
6 months ago

The values in this column can be described as semistructured information. They are not immediately suitable for analysis, but they do contain recognizable components that follow a consistent pattern. Each entry typically begins with a number and contains the word “ago.” In some cases, the word “edited” appears, indicating that the comment was modified after it was originally posted.

To make this column more useful, we will parse it into several new columns, with each one capturing a single attribute. The overall workflow is as follows:

  • Decide which output columns are required
  • Use parsing functions to separate the data
  • Apply transformations, such as converting data types
  • Review the results across the full dataset, since some rows may not match the expected format
  • Repeat these steps until the information is structured correctly

From the Time column, we will derive the columns Digit, Duration, Ago, and Edit. In the next section, we will introduce parsing functions and begin restructuring the YouTube comments dataset.

4.4 Text Parsing

Parsing data with Polars involves extracting specific portions of a text value to make the data more useful for analysis. In practice, parsing separates the part of the text we care about from everything else, although the resulting code typically returns only the desired portion.

The simplest parsing functions return a fixed number of characters from either the beginning or the end of a string. The slice function returns characters starting at the specified offset position and continuing to the rightmost character in the text. When the optional length parameter is provided, the function returns exactly the number of characters specified by that value. For example, setting length=3 returns three characters. Using a negative value for offset changes the starting position to count from the right side of the string. Therefore, given the text “Joram” and offset=-3, the returned value is “ram”.

Note

Both the offset and length parameters are defined in terms of the number of characters in a UTF-8 string. A character is defined as a Unicode scalar value. A single character is represented by one byte when working with ASCII text, and by up to four bytes otherwise.

In the youtube_comments dataframe, we can parse out the word “ago” from the Time column using the slice function.

(youtube_comments
 .select(pl.col('Time').str.slice(-3))
 )
shape: (4_839, 1)
Time
str
"ed)"
"ago"
"ago"
"ed)"
"ago"
"ago"
"ago"
"ed)"
"ed)"
"ago"

However, this approach only works for values that end with the word “ago”. We could try adjusting the offset value, but that still would not work for every row. The slice function is most effective when the text values follow a consistent pattern. To handle all rows correctly, we can introduce an additional function, split. This function divides the text based on a specified delimiter, such as a space, a single character, or an entire word, and converts the column’s data type to List. The delimiter itself is not included in the returned text.

(youtube_comments
 .select(pl.col('Time').str.split(' (edited)').list.get(0))
 .with_columns(pl.col('Time').str.slice(-3))
 )
shape: (4_839, 1)
Time
str
"ago"
"ago"
"ago"
"ago"
"ago"
"ago"
"ago"
"ago"
"ago"
"ago"

We can confirm that all records end with the word “ago” by using group_by to count the occurrences of each value and then verifying that the count for “ago” matches the total number of rows in the dataframe. In this case, the counts confirm that this is true.

(youtube_comments
 .select(pl.col('Time').str.split(' (edited)').list.get(0))
 .with_columns(pl.col('Time').str.slice(-3))
 .group_by('Time').len()
 )
shape: (1, 2)
Time len
str u32
"ago" 4839

Having demonstrated how to use the slice and split functions, we can now create four new columns, Digit, Duration, Ago, and Edit, from the Time column. Each column will be populated with the appropriate value extracted from the original text.

(youtube_comments
 .select('Time')
 .with_columns(pl.col('Time').str.split(' '))
 .with_columns(Digit=pl.col('Time').list[0].cast(pl.Int8),
1               Duration=pl.col('Time').list.get(1),
               Ago=pl.col('Time').list.get(2),
2               Edit=pl.col('Time').list.get(3, null_on_oob=True))
 )
1
Elements in a list can be accessed using the get expression instead of index-based selection such as list[1].
2
Rows corresponding to unedited comments contain fewer elements in their list. As a result, attempting to retrieve the fourth element with get(3) would be out of bounds and cause the operation to fail. Setting the parameter null_on_oob=True inserts a null value for those rows instead.
shape: (4_839, 5)
Time Digit Duration Ago Edit
list[str] i8 str str str
["3", "years", … "(edited)"] 3 "years" "ago" "(edited)"
["6", "months", "ago"] 6 "months" "ago" null
["6", "months", "ago"] 6 "months" "ago" null
["6", "months", … "(edited)"] 6 "months" "ago" "(edited)"
["9", "months", "ago"] 9 "months" "ago" null
["8", "months", "ago"] 8 "months" "ago" null
["1", "year", "ago"] 1 "year" "ago" null
["1", "year", … "(edited)"] 1 "year" "ago" "(edited)"
["11", "months", … "(edited)"] 11 "months" "ago" "(edited)"
["11", "months", "ago"] 11 "months" "ago" null

After parsing the column with Polars, the data is now organized into a clearer and more practical structure. Before concluding, however, we can make a few additional adjustments to further refine the dataset. In the next section, we will explore functions designed specifically for transforming strings.

4.5 Text Transformations

Transformations modify string values in some way. In Chapter 3, we explored several date and datetime transformation functions. Polars also provides a dedicated set of functions for working with string values. These functions are especially useful when working with parsed data, as well as any text data that must be cleaned or standardized before analysis.

One of the most common string transformations involves changing capitalization. The to_uppercase function converts all letters to uppercase, while the to_lowercase function converts all letters to lowercase. For example:

(youtube_comments
 .select(Original=pl.lit('Deep Analysis with Polars'))
 .with_columns(Upper=pl.col('Original').str.to_uppercase(),
               Lower=pl.col('Original').str.to_lowercase())
 )
shape: (1, 3)
Original Upper Lower
str str str
"Deep Analysis with Polars" "DEEP ANALYSIS WITH POLARS" "deep analysis with polars"

These transformations are helpful for standardizing values that may have been entered inconsistently. A human reader easily recognizes that “Python,” “pyThoN,” and “PYTHON” all refer to the same programming language. Polars, however, treats them as distinct values. If we counted comments by programming language using these strings as-is, we would end up with three separate records for Python, leading to incorrect conclusions. Converting all values to uppercase or lowercase resolves this issue.

Polars also provides the to_titlecase function, which capitalizes the first letter of each word in a string. This is particularly useful for proper nouns, such as names of programming languages or people:

(youtube_comments
 .select(Language=pl.lit('cLoJuRe'),
         Creator=pl.lit('rich hickey'))
 .with_columns(Title_Language=pl.col('Language').str.to_titlecase(),
               Title_Creator=pl.col('Creator').str.to_titlecase())
 )
shape: (1, 4)
Language Creator Title_Language Title_Creator
str str str str
"cLoJuRe" "rich hickey" "Clojure" "Rich Hickey"

Suppose we filter rows that mention the programming language Python in the Text column. We will find several variations of the string, including values written entirely in uppercase such as “PYTHON”. To clean and standardize these values so that all mentions use only an initial capital letter, we can apply the to_titlecase function as part of the transformation pipeline:

(youtube_comments
 .select('Text')
 .with_columns(pl.col('Text').str.split(' '))
 .with_columns(Language=pl.col('Text').list.eval(
1     pl.element().filter(pl.element().str.to_lowercase() == 'python')))
2 .filter(pl.col('Language').list.len() > 0)
3 .with_columns(pl.col('Language').list.unique().list.get(0))
 .select(pl.col('Language'),
         Language_Clean=pl.col('Language').str.to_titlecase())
 .unique('Language')
 )
1
Returns all rows that mention Python, regardless of the string case used.
2
Filters out rows that do not mention Python.
3
Ensures that all mentions of Python with the same string case are counted as a single value in the list. If the resulting list contains more than one element, only the first is retained.
shape: (3, 2)
Language Language_Clean
str str
"Python" "Python"
"python" "Python"
"PYTHON" "Python"

Another useful transformation method is strip_chars, which removes whitespace from the beginning and end of a string. Extra whitespace commonly appears when values are extracted from longer text, entered manually, or copied between applications. For example, a string such as “Clojure” padded with spaces can be cleaned using strip_chars:

(youtube_comments
 .select(Language=pl.lit('   Clojure   '))
 .with_columns(Language_Stripped=pl.col('Language').str.strip_chars())
 )
shape: (1, 2)
Language Language_Stripped
str str
"   Clojure   " "Clojure"

If you only need to remove spaces from the start of a string, use strip_chars_start. To remove spaces from the end, use strip_chars_end:

(youtube_comments
 .select(Language=pl.lit('   Clojure   '))
 .with_columns(Strip_Start=pl.col('Language').str.strip_chars_start(),
               Strip_End=pl.col('Language').str.strip_chars_end())
 )
shape: (1, 3)
Language Strip_Start Strip_End
str str str
"   Clojure   " "Clojure   " "   Clojure"

Polars also includes functions for removing specific text fragments from either end of a string. These are strip_prefix and strip_suffix. For instance, the Source column contains values that end with “.json”. That suffix can be removed using strip_suffix:

(youtube_comments
 .with_columns(pl.col('Source').str.strip_suffix('.json'))
 )
shape: (4_839, 5)
Date Time Text Author Source
datetime[μs] str str str str
2022-12-11 16:04:44 "3 years ago (edited)" "It took me 7 years of working … "@Stefanovic92" "stefanovic"
2025-06-11 15:04:44 "6 months ago" "1:50 Fiat Multipla is indeed t… "@sasker1000" "stefanovic"
2025-06-11 15:04:44 "6 months ago" "Bro has beautiful eyes for no … "@cpblair" "stefanovic"
2025-06-11 15:04:44 "6 months ago (edited)" "Man, I did it. I actually list… "@cometamítico" "stefanovic"
2025-03-11 15:04:44 "9 months ago" "I think I clicked on this vide… "@Hervoicehealed" "stefanovic"
2025-04-11 14:49:31 "8 months ago" "Like with women, the numbers g… "@kknight1990" "avery_smith"
2024-12-11 15:49:31 "1 year ago" "I can answer that if you want … "@vvs1998" "avery_smith"
2024-12-11 15:49:31 "1 year ago (edited)" " @vvs1998 Tableau Public is fr… "@Nikopup" "avery_smith"
2025-01-11 15:49:31 "11 months ago (edited)" "​Thanks ​ @vvs1998 for your ho… "@adedapomichaelowoeye9353" "avery_smith"
2025-01-11 15:49:31 "11 months ago" "Thanks" "@Esss8499" "avery_smith"

Another transformation worth highlighting is the replace function. This function is helpful when a word, symbol, or substring within a column needs to be changed or removed. At its simplest, replace takes two arguments: the text to search for and the replacement text. Continuing the cleanup of the Source column, we can replace underscores with a space and then apply to_titlecase to format the text consistently:

youtube_comments_clean = (youtube_comments
 .with_columns(pl.col('Source').str.strip_suffix('.json'))
 .with_columns(pl.col('Source').str.replace('_',' ').str.to_titlecase())
 )
youtube_comments_clean
shape: (4_839, 5)
Date Time Text Author Source
datetime[μs] str str str str
2022-12-11 16:04:44 "3 years ago (edited)" "It took me 7 years of working … "@Stefanovic92" "Stefanovic"
2025-06-11 15:04:44 "6 months ago" "1:50 Fiat Multipla is indeed t… "@sasker1000" "Stefanovic"
2025-06-11 15:04:44 "6 months ago" "Bro has beautiful eyes for no … "@cpblair" "Stefanovic"
2025-06-11 15:04:44 "6 months ago (edited)" "Man, I did it. I actually list… "@cometamítico" "Stefanovic"
2025-03-11 15:04:44 "9 months ago" "I think I clicked on this vide… "@Hervoicehealed" "Stefanovic"
2025-04-11 14:49:31 "8 months ago" "Like with women, the numbers g… "@kknight1990" "Avery Smith"
2024-12-11 15:49:31 "1 year ago" "I can answer that if you want … "@vvs1998" "Avery Smith"
2024-12-11 15:49:31 "1 year ago (edited)" " @vvs1998 Tableau Public is fr… "@Nikopup" "Avery Smith"
2025-01-11 15:49:31 "11 months ago (edited)" "​Thanks ​ @vvs1998 for your ho… "@adedapomichaelowoeye9353" "Avery Smith"
2025-01-11 15:49:31 "11 months ago" "Thanks" "@Esss8499" "Avery Smith"
Tip

If a value in Source contains multiple underscores, use replace_all, since replace only modifies the first match.

The chart below displays the number of comments grouped by source. Ken Jee’s video clearly dominates, with the highest comment count, followed by Agatha and Tom Gadsby.

(youtube_comments_clean
 .group_by('Source').len()
 .sort('len')
 .hvplot.barh(x='Source', y='len', height=400,
              xlabel='', ylabel='',
              title='Distribution of comments by video source')
 .opts(fontsize={'xticks': 11, 'yticks': 11})
 )

So far, we have seen how parsing and transformation functions can clean structured text data and make it more suitable for analysis. In the next section, we will focus on the Text column and explore how Polars can be used to identify specific patterns or elements within free-form text.

4.6 Finding Elements Within Larger Blocks of Text

Working with text often requires parsing and transforming it so that it can be analyzed more effectively. Another common task is searching for specific substrings within larger blocks of text. This approach is useful for narrowing results, classifying records, or replacing matched strings with alternative values.

4.6.1 Wildcard Matches: contains

Polars provides several tools for identifying patterns in text. The contains function checks whether a given pattern appears anywhere in a string. The starts_with function verifies whether the text begins with a specified pattern, while ends_with confirms whether it concludes with one. Because these functions return a boolean mask, they are often used in combination with the filter method. For example, we can count how many comments reference the programming language SQL:

(youtube_comments_clean
 .filter(pl.col('Text').str.contains('SQL'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
125
Note

By default, n_unique treats null as a distinct value. In this example, null values are excluded because the filter operation is applied first.

This result shows that more than one hundred comments mention “SQL.” However, this approach only matches the uppercase string. What happens if some commenters write “Sql” or use the lowercase form “sql” instead? There are two common ways to solve this.

One option is to transform the column being searched using either the to_uppercase or to_lowercase function discussed in the previous section. By converting all text to a consistent case, the search effectively becomes case insensitive:

(youtube_comments_clean
 .filter(pl.col('Text').str.to_uppercase().str.contains('SQL'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
178

Another approach is to list all expected variations of the text and separate them with the pipe operator (|), which represents a logical or:

(youtube_comments_clean
 .filter(pl.col('Text').str.contains('SQL|sql|Sql'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
178

This method quickly becomes impractical when many variations are possible. Enumerating all cases is tedious and prone to mistakes. A more robust solution is to prefix the pattern with the case-insensitive flag (?i):

(youtube_comments_clean
 .filter(pl.col('Text').str.contains('(?i)sql'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
178

Any of these pattern matching functions, including contains, starts_with, and ends_with, can also be negated using the tilde symbol (~). For example, to find records that do not mention “sql,” you can write:

(youtube_comments_clean
 .filter(~pl.col('Text').str.contains('(?i)sql'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
4452

Filtering on multiple strings can be done using | for the or operator and & for the and operator. For example, we can count the number of comments that mention “SQL” or “Python” (case sensitive):

(youtube_comments_clean
 .filter(pl.col('Text').str.contains('SQL') | pl.col('Text').str.contains('Python'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
189

A more concise version of the same code uses syntactic sugar:

(youtube_comments_clean
 .filter(pl.col('Text').str.contains('SQL|Python'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
189

We can also return only comments that mention both “SQL” and “Python”:

(youtube_comments_clean
 .filter(pl.col('Text').str.contains('SQL') & pl.col('Text').str.contains('Python'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
41
Tip

You can replace & with a comma , when performing and operations.

It is important to use parentheses to control the order of operations when combining | and &. Without them, you may get unexpected results. For instance, in the two code versions below do not returnn the same results because | is evaluated before &: e Version 1

(youtube_comments_clean
 .filter(pl.col('Text').str.contains('SQL') | pl.col('Text').str.contains('Python') &
         pl.col('Text').str.contains('excel'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
130

Version 2

(youtube_comments_clean
 .filter((pl.col('Text').str.contains('SQL') | pl.col('Text').str.contains('Python')) & 
         pl.col('Text').str.contains('excel'))
 .select(Count=pl.col('Text').n_unique())
 )
shape: (1, 1)
Count
u32
18

The contains function is also useful within a when-then-otherwise expression for classifying or summarizing rows. Let’s start with classification. By using contains inside a when-then-otherwise expression, we can assign labels and groups to records. For example, some comments reference technologies such as Tableau or Power BI. We can identify how many entries mention these technologies by applying contains in this way:

(youtube_comments_clean
 .with_columns(pl.col('Text').str.to_lowercase())
 .with_columns(
     Technology=pl.when(pl.col('Text').str.contains('power bi|powerbi'))
         .then(pl.lit('Power BI'))
         .when(pl.col('Text').str.contains('python'))
         .then(pl.lit('Python'))
         .when(pl.col('Text').str.contains('sql'))
         .then(pl.lit('SQL'))
         .when(pl.col('Text').str.contains('tableau'))
         .then(pl.lit('Tableau'))
         .when(pl.col('Text').str.contains('excel'))
         .then(pl.lit('Excel'))
         .otherwise(None))
 .group_by('Technology')
 .agg(Count=pl.len())
 .sort('Count', descending=True)
)
shape: (6, 2)
Technology Count
str u32
null 4360
"Python" 246
"SQL" 81
"Power BI" 75
"Excel" 65
"Tableau" 12

Python appears most frequently, while SQL and Excel are mentioned far less often. This outcome is not surprising, since Python has become the standard language for many people working in data-related fields.

It is important to note that this when-then-otherwise approach assigns only one label per comment, evaluating each condition in sequence. As a result, if a comment includes both “Python” and “SQL,” it will be categorized as “Python.” This approach works well in many situations. However, when analyzing longer text sources such as customer reviews, survey responses, YouTube comments, or product feedback, the ability to tag records with multiple categories becomes more valuable. In those cases, boolean masks provide a better solution.

Earlier, we saw that contains can return a simple true or false value, which can be used to flag rows. Because some comments mention more than one technology, it can be helpful to create separate columns that indicate whether each technology is present:

(youtube_comments_clean
 .with_columns(pl.col('Text').str.to_lowercase())
 .with_columns(
     Power_BI=pl.col('Text').str.contains('power bi|powerbi'),
     Python=pl.col('Text').str.contains('python'),
     SQL=pl.col('Text').str.contains('sql'),
     Tableau=pl.col('Text').str.contains('tableau'),
     Excel=pl.col('Text').str.contains('excel'))
 .group_by('Power_BI', 'Python', 'SQL', 'Tableau','Excel')
 .agg(Count=pl.len())
 .sort('Count', descending=True)
)
shape: (28, 6)
Power_BI Python SQL Tableau Excel Count
bool bool bool bool bool u32
false false false false false 4360
false true false false false 174
false false false false true 65
false false true false false 51
false true true false false 33
false true false true false 3
true false true false false 3
true true false false true 2
false false false true true 1
true true true true false 1

This produces a boolean matrix that shows how often different combinations of technologies occur. It also highlights cases where one tool is mentioned without any of the others.

ImportantExpand Your Knowledge

Each distinct combination of the five boolean flags receives its own count. Examples include:

  • Power_BI=False, Python=False, SQL=False, Tableau=False, Excel=False: number of comments with none of these technologies
  • Power_BI=True, Python=False, SQL=False, Tableau=False, Excel=False: number of comments mentioning only “Power BI”
  • Power_BI=True, Python=True, SQL=False, Tableau=False, Excel=False: number of comments mentioning both “Power BI” and “Python”
  • etc.

Because there are five boolean columns, the maximum number of possible combinations is 25 = 32, although not all combinations may appear in practice.

This approach differs from the earlier when-then-otherwise method, where each record was limited to a single label. Here, multiple True values can exist for the same record, and the group_by operation generates counts for every unique True or False pattern across the five columns.

These combinations can be especially useful when preparing datasets for others to explore or when working in BI or visualization platforms. In some situations, however, it is more practical to condense the data further and apply aggregations to rows that match a given string pattern. In this example, we count records, but other operations such as sum or mean could be applied if the dataset includes numeric columns, such as revenue totals.

to_plot = (youtube_comments_clean
 .with_columns(pl.col('Text').str.to_lowercase())
 .select(Power_BI=(pl.col('Text').str.contains('power bi|powerbi')).sum(),
         Python=(pl.col('Text').str.contains('python')).sum(),
         SQL=(pl.col('Text').str.contains('sql')).sum(),
         Tableau=(pl.col('Text').str.contains('tableau')).sum(),
         Excel=(pl.col('Text').str.contains('excel')).sum())
)
to_plot
shape: (1, 5)
Power_BI Python SQL Tableau Excel
u32 u32 u32 u32 u32
75 269 178 79 154

We now have a cleaner overview showing how often each technology appears in the comments, and it’s clear that “Python” shows up more frequently than the others. The visualization below illustrates these results.

(to_plot
 .with_row_index()
 .rename(dict(Power_BI='Power BI'))
 .unpivot(index='index', variable_name='Technology', value_name='Count')
 .drop('index')
 .sort('Count', descending=False)
 .hvplot.barh(x='Technology', y='Count', 
              xlabel='', ylabel='', height=400,
              title='Frequency of technologies mentioned in comments')
 .opts(fontsize={'xticks': 11, 'yticks': 11})
 )

In the code above, a single record that mentions multiple technologies is counted once for each technology it contains. What we lose, however, is the ability to see the exact combinations of technologies that appear together in the same record.

String functions like contains, starts_with, and ends_with are versatile and can be applied throughout Polars workflows to filter, classify, or summarize data depending on the output you need. When combined with the text-processing and transformation methods introduced earlier, they provide even greater flexibility. In the next section, I’ll move on to handling cases where matches need to be exact, before expanding the discussion to include more advanced techniques with regular expressions.

4.6.2 Exact Matches: is_in

Before moving on to more advanced pattern matching with regular expressions, it is helpful to pause and examine another function: is_in. Although it is not strictly a pattern matching function, is_in allows you to check values against a predefined list, which often leads to cleaner and more concise code.

Suppose we want to classify comments based on their first word. We can extract this word by applying the split function with a space as the separator. Many comments begin with words such as:

  • Thanks
  • Great
  • How
  • Can
  • This
  • What

One way to isolate these records is to stack together dataframes filtered by each word:

comments_1st_word_df = (youtube_comments_clean
 .with_columns(First_Word=pl.col('Text').str.split(' ').list.first())
 .select('First_Word','Text')
)

(comments_1st_word_df
 .filter(First_Word='Thanks')
 .vstack(comments_1st_word_df
         .filter(First_Word='Great'))
 .vstack(comments_1st_word_df
         .filter(First_Word='How'))
 .vstack(comments_1st_word_df
         .filter(First_Word='Can'))
 .vstack(comments_1st_word_df
         .filter(First_Word='This'))
 .vstack(comments_1st_word_df
         .filter(First_Word='What'))
 )
shape: (667, 2)
First_Word Text
str str
"Thanks" "Thanks for the video I'm looki…
"Thanks" "Thanks for this content man...…
"Thanks" "Thanks man!"
"Thanks" "Thanks you for the video"
"Thanks" "Thanks but I don’t want to be …
"What" "What if my degree is a healthc…
"What" "What about statistics and math…
"What" "What projects should I do if I…
"What" "What about supply chain?"
"What" "What if I absolutely hate exce…

To reduce repetition, we could instead loop over the list of words and build the combined dataframe programmatically:

word_dfs = []
for word in ['Thanks','Great','How','Can','This','What']:
    word_df = (comments_1st_word_df
     .filter(First_Word=word)
    )
    word_dfs.append(word_df)
pl.concat(word_dfs)
shape: (667, 2)
First_Word Text
str str
"Thanks" "Thanks for the video I'm looki…
"Thanks" "Thanks for this content man...…
"Thanks" "Thanks man!"
"Thanks" "Thanks you for the video"
"Thanks" "Thanks but I don’t want to be …
"What" "What if my degree is a healthc…
"What" "What about statistics and math…
"What" "What projects should I do if I…
"What" "What about supply chain?"
"What" "What if I absolutely hate exce…

Although both approaches work, they are not the most concise ways to write Polars code. The is_in function provides a simpler and less error-prone alternative. Just be sure that the values you pass match the column’s data type. Use numbers for numeric columns and quoted strings for text columns.

to_plot = (youtube_comments_clean
 .with_columns(First_Word=pl.col('Text').str.split(' ').list.first())
 .filter(pl.col('First_Word').is_in(['Thanks','Great','How','Can','This','What']))
 .select('First_Word','Text')
)
to_plot
shape: (667, 2)
First_Word Text
str str
"This" "This is a lot mehnnnn....."
"Great" "Great content, I hope you keep…
"How" "How can I get a remote data an…
"Can" "Can you recommend any complete…
"This" "This is exactly how I started …
"This" "This is changing fast imo"
"How" "How about qlikview?"
"Thanks" "Thanks"
"Great" "Great stuff !!"
"Thanks" "Thanks"

Both methods produce the same result, and the frequencies can be visualized as follows:

(to_plot
 .group_by('First_Word').len()
 .sort('len', descending=True)
 .hvplot.bar(x='First_Word', y='len',
             xlabel='', ylabel='Count', height=400,
             title='Frequency of select words used as the first word in comments')
 .opts(fontsize={'xticks': 11, 'yticks': 11})             
 )

The main advantage of is_in is that it keeps the code compact and easy to read. This becomes especially valuable when building more complex categorizations inside a when-then-otherwise expression. In the next section, we explore more advanced text handling techniques in Polars, with a focus on regular expressions.

4.6.3 Regular Expressions

Polars offers several techniques for pattern matching. Among them, regular expressions, often called regex, stand out as extremely powerful but also easy to misuse. I enjoy working with regular expressions, but I try not to rely on them too much. The reason is straightforward. Regex patterns are easy to forget and difficult to maintain over time. The cautionary quote at the start of this chapter is intentional. Before using regex to process text, pause and ask yourself, “Is there an easier way to solve this?”

Regular expressions are strings of characters that describe search rules, many of which carry special meanings. The greatest challenge in learning and applying regex is its cryptic syntax. Regex does not resemble everyday language, nor does it closely mirror programming languages such as Python or JavaScript. However, once you understand the purpose of the special symbols, you can write and interpret regex with confidence. Like the Polars code we’ve been writing, it is best to start with simple expressions, introduce complexity gradually, and validate results step by step. Clear and frequent comments are essential, both for collaborators and for your future self.

Regex functions as a kind of mini-language, but it only exists within other programming environments. You can use it in Python, JavaScript, SQL, and many other tools, but you cannot write standalone programs using regex alone. Although the syntax differs slightly across platforms, the core concepts remain consistent, making it relatively easy to adapt once you understand the fundamentals.

This book does not attempt to provide a complete reference for regex syntax or usage. Instead, it focuses on the essentials needed to solve common text-processing problems in Polars. We begin by examining how Polars identifies regex patterns, then explore the core syntax, and finally work through examples that apply regex to the youtube_comments_clean dataframe.

Note

Polars does not rely on POSIX (Portable Operating System Interface) regex. Instead, it uses the Rust regex crate, which implements a faster and distinct style of regular expressions that differs from both Python’s re library and POSIX standards.

Table 4.1 summarises common metacharacters used in regular expressions.

Table 4.1: Common regex syntax.
Syntax What it does
. Matches any character.
\d Matches any single digit.
\w Matches any part of word character (equivalent to [A-Za-z0-9]).
\s Matches any space, tab, or newline.
\ Used to escape the following character when that character is a special character.
^ Is an “anchor” which asserts the position at the start of the line.
$ Is an “anchor” which asserts the position at the end of the line.
\b Asserts that the pattern must match at a word boundary. Putting this either side of a word stops the regular expression matching longer variants of words.

The Polars functions that accept regular expressions include replace, replace_all, contains, and extract. To use regex in Polars, we prefix the string with the letter r. If extracting or reusing parts of the match, enclose the pattern in parentheses. For example, we can use the first pattern shown in Table 4.1 to filter rows in Text that match the pattern. Since the pattern matches any character, this operation returns all rows.

(youtube_comments_clean
 .select('Text')
 .filter(pl.col('Text').str.contains(r'(.)'))
 )
shape: (4_839, 1)
Text
str
"It took me 7 years of working …
"1:50 Fiat Multipla is indeed t…
"Bro has beautiful eyes for no …
"Man, I did it. I actually list…
"I think I clicked on this vide…
"Like with women, the numbers g…
"I can answer that if you want …
" @vvs1998 Tableau Public is fr…
"​Thanks ​ @vvs1998 for your ho…
"Thanks"

Alternatively, we can return only rows that do not start with either a number or a letter:

(youtube_comments_clean
 .select('Text')
 .filter(~pl.col('Text').str.contains(r'(^\w)'))
 )
shape: (564, 1)
Text
str
"🚀 Revolutionize Your Data Mana…
" @Stefanovic92  thanks bro!"
" @Stefanovic92  thank you😊🙏"
" @Stefanovic92  ya i wanted to…
"🇧🇷"
"😂😂😂"
"@Maubinyaachi"
" @govinddangi9655 it’s really …
" @vvs1998 Tableau Public is fr…
"​Thanks ​ @vvs1998 for your ho…

Another important set of symbols is [ and ], known as square brackets. When used together, they define a character group in which any single character listed inside the brackets can be matched. Even if multiple characters appear inside the brackets, the pattern still corresponds to only one character at a time. Later, we will explore how to repeat matches. A common use of square brackets is handling case sensitivity by including both uppercase and lowercase versions of a letter. Avoid using commas, since a comma would be treated as a literal character to match.

(youtube_comments_clean
 .select(Original=pl.lit('The data is about YouTube comments.'))
 .filter(pl.col('Original').str.contains(r'([Tt]he)'))
 )
shape: (1, 1)
Original
str
"The data is about YouTube comm…

In this example, the pattern matches either “the” or “The”. Because the example string starts with one of these values, the filter returns the row. This behavior differs from using the case-insensitive flag (?i), since variations such as “tHe” and “THE” do not match this pattern:

(youtube_comments_clean
 .select(Original=pl.lit('tHe data is about YouTube comments.'))
 .filter(pl.col('Original').str.contains(r'([Tt]he)'))
 )
shape: (0, 1)
Original
str

Square brackets are also useful for constructing patterns that involve digits, which allows us to match numeric values. Suppose we want to capture any Time entry that refers to “2 months,” “3 months,” or “4 months.” While this could be implemented using a when-then-otherwise expression with multiple contains checks, regex offers a shorter and more readable solution:

(youtube_comments_clean
 .select('Time')
 .filter(pl.col('Time').str.contains(r'([234] month)'))
 )
shape: (250, 1)
Time
str
"4 months ago"
"2 months ago"
"2 months ago"
"4 months ago"
"4 months ago"
"2 months ago"
"3 months ago"
"2 months ago"
"3 months ago"
"3 months ago"

To match any digit, we could list all numbers inside square brackets:

[0123456789]

Fortunately, regex also supports defining ranges of characters using a dash (-). For example, [0-9] represents all digits. We can further restrict the range, such as [0-3] or [4-9]. Using a range is equivalent to writing out each digit individually:

(youtube_comments_clean
 .select('Time')
 .filter(pl.col('Time').str.contains(r'([2-4] month)'))
 )
shape: (250, 1)
Time
str
"4 months ago"
"2 months ago"
"2 months ago"
"4 months ago"
"4 months ago"
"2 months ago"
"3 months ago"
"2 months ago"
"3 months ago"
"3 months ago"

Letter ranges are defined in the same way as numeric ranges. Table 4.2 summarizes the range patters most commonly used in Polars. Square brackets can also include symbols or other non-alphanumeric characters, such as [@%$]:

Table 4.2: Regex range patterns.
Range pattern Purpose
[0-9] Match any number.
[a-z] Match any lowercase letter.
[A-Z] Match any uppercase letter.
[A-Za-z0-9] Match any lowercase or uppercase letter, or any number.
[A-z] Match any ASCII character; generally not used because it matches everything, including symbols

When a pattern needs to capture multiple occurrences of the same type of character, we can place several ranges back to back. For example, a three-digit number can be matched by repeating the digit range three times:

(youtube_comments_clean
 .select(Original=pl.lit("I've clocked 126 days as a Data Analyst."))
 .filter(pl.col('Original').str.contains(r'([0-9][0-9][0-9] days)'))
 )
shape: (1, 1)
Original
str
"I've clocked 126 days as a Dat…

We can also use repetition operators to specify how many times a pattern may occur. This approach is especially useful when the exact number of repetitions is unknown. However, it is important to inspect the results carefully to ensure that the pattern does not capture more matches than intended. To require one or more occurrences, append the + symbol to the pattern:

(youtube_comments_clean
 .select(Original=pl.lit("I've clocked 126 days as a Data Analyst."))
 .filter(pl.col('Original').str.contains(r'([0-9]+ days)'))
 )
shape: (1, 1)
Original
str
"I've clocked 126 days as a Dat…

Table 4.3 summarizes the available options for controlling how often a character set is repeated.

Table 4.3: Regex patterns for matching a character set multiple times.
Symbol Purpose
+ Match one or more instances of the character set.
* Match zero or more instances of the character set.
? Match zero or one instance of the character set.
{ } Match the character set the specified number of times; for example, {3} matches exactly three
{ , } Match the character set between a lower and upper bound of repetitions, such as {3,5}

Sometimes a pattern must include a character that normally has a special meaning in regex. In these cases, we need to indicate that the character should be treated literally rather than as an operator. This is done using an escape character, which in regex is the backslash (\). For example, we can return the row that contains a question mark by escaping it explicitly:

sample_text = [
    '"Should I learn Tableau?" He asked',
    'The course fee was under ^72.',
]

(youtube_comments_clean
 .select(Original=pl.lit(sample_text).explode())
1 .filter(pl.col('Original').str.contains(r'(\?)'))
 )
1
Omitting the backslash \ causes an error.
shape: (1, 1)
Original
str
""Should I learn Tableau?" He a…

The code below does not escape the caret ^, so Polars interprets it as a regex operator rather than a literal character. As a result, the expression looks for rows that start with a digit and returns an empty dataframe.

(youtube_comments_clean
 .select(Original=pl.lit(sample_text).explode())
 .filter(pl.col('Original').str.contains(r'(^[0-9]+)'))
 )
shape: (0, 1)
Original
str

By inserting a backslash immediately before ^, the pattern treats it as a literal character and returns the expected row.

Text data often includes whitespace characters. These range from the visible space character to less obvious symbols such as tabs and line breaks, which may not always be displayed. We will later explore how to replace these characters using regular expressions. For now, the focus is on how to match them. A tab is represented by \t. Line breaks can be matched with \r for a carriage return or \n for a line feed. In some systems, both are required together as \r\n. Try running the code on your machine to determine which combination produces the expected result. To match any type of whitespace, you can use \s. Keep in mind that this pattern also includes the standard space character.

We will create a list containing four sentences and use up to one sentence per Polars column named Comparison_1 through Comparison_4. We will then check whether any of those columns contain whitespace or newline characters.

sample_text = [
'''filtering
pivoting
and aggregating''',

'''filtering
pivoting
and aggregating''',

'''filtering pivoting''',

'''aggregating'''
]

As a reminder, we can retrieve the first sentence in sample_text with the following expression:

sample_text[0]
'filtering\npivoting\nand aggregating'

Next, we will create a Polars dataframe with columns named Comparison_1 through Comparison_4. Each column will indicate whether a specific character appears in the text. If the character is present, the value in that column will be true.

(youtube_comments_clean
 .head(1)
 .select(Comparison_1=pl.lit(sample_text[0]).str.contains(r'\n'),
         Comparison_2=pl.lit(sample_text[1]).str.contains(r'\s'),
         Comparison_3=pl.lit(sample_text[2]).str.contains(r'\s'),
         Comparison_4=pl.lit(sample_text[3]).str.contains(r'\s'))
 )
shape: (1, 4)
Comparison_1 Comparison_2 Comparison_3 Comparison_4
bool bool bool bool
true true true false

Much like in mathematics, parentheses in regular expressions allow us to group parts of an expression so they are treated as a single unit. This becomes especially useful when you want to match a more complex sequence that must occur multiple times.

sample_text = [
'valid codes follow the pattern 12a34b56c',
'the first code entered was 123a456c',
'the second code entered was 99x66y33z',
]

(youtube_comments_clean
 .head(1)
 .select(Comparison_1=pl.lit(sample_text[0]).str.contains(r'([0-9]{2}[a-z]){3}'),
         Comparison_2=pl.lit(sample_text[1]).str.contains(r'([0-9]{2}[a-z]){3}'),
         Comparison_4=pl.lit(sample_text[2]).str.contains(r'([0-9]{2}[a-z]){3}'))
 )
shape: (1, 3)
Comparison_1 Comparison_2 Comparison_4
bool bool bool
true false true

All three lines use the same regular expression: ([0-9]{2}[a-z]){3}. The portion inside the parentheses, [0-9]{2}[a-z], describes two digits followed by a lowercase letter. The {3} that follows requires this sequence to appear three times consecutively. The first example satisfies this rule exactly, since the string 12a34b56c matches the pattern. The second example does not. Although it starts correctly with 23a and continues with 23a45, the third repetition ends with a digit instead of a letter (23a456), which violates the rule. The third example, 99x66y33z, matches the expression in full.

As we have seen, regular expressions can be combined in many ways with other expressions, whether regex patterns or plain text, to form powerful matching rules. In addition to defining what to match, regex can also specify where a match should occur. The special character \b anchors a match to the start or end of a word. For example, suppose we want to locate the word “python” in YouTube comments. We might begin with an expression like this:

sample_text = [
'I love learning python for data analysis',
'Writing pythonic code requires expertise',
'I find pythonbytes.fm very informative',
'I''m a pythonista who codes every day',
]

(youtube_comments_clean
 .head(1)
 .select(Comparison=pl.lit(sample_text[0]).str.contains(r'python'))
 )
shape: (1, 1)
Comparison
bool
true

This code finds “python” in the string and returns true, as expected. Now consider testing a few additional comments using the same expression:

(youtube_comments_clean
 .head(1)
 .select(Comparison_1=pl.lit(sample_text[0]).str.contains(r'python'),
         Comparison_2=pl.lit(sample_text[1]).str.contains(r'python'),
         Comparison_3=pl.lit(sample_text[2]).str.contains(r'python'),
         Comparison_4=pl.lit(sample_text[3]).str.contains(r'python'),
         )
 )
shape: (1, 4)
Comparison_1 Comparison_2 Comparison_3 Comparison_4
bool bool bool bool
true true true true

All of these strings match “python,” even though “pythonista” is not what we intended to capture when searching for references to the programming language. To address this, we can add \b to both the beginning and end of the “python” pattern:

(youtube_comments_clean
 .head(1)
 .select(Comparison_1=pl.lit(sample_text[0]).str.contains(r'\bpython\b'),
         Comparison_2=pl.lit(sample_text[1]).str.contains(r'\bpython\b'),
         Comparison_3=pl.lit(sample_text[2]).str.contains(r'\bpython\b'),
         Comparison_4=pl.lit(sample_text[3]).str.contains(r'\bpython\b'),
         )
 )
shape: (1, 4)
Comparison_1 Comparison_2 Comparison_3 Comparison_4
bool bool bool bool
true false false false

With the boundary markers in place, only the standalone word “python” is matched, and longer words that contain it are excluded. While adding spaces before and after the word could produce a similar effect, \b has the advantage of also working when the word appears at the very beginning of a string:

other_text = [
    'python tutorials are the best on YouTube',
    'I love learning python for data analysis',
    'My favorite tool is python',
    'The debate over R vs python is over',
]

(youtube_comments_clean
 .head(1)
 .select(Comparison_1=pl.lit(other_text[0]).str.contains(r'\bpython\b'),
         Comparison_2=pl.lit(other_text[0]).str.contains(r' python '))
 )
shape: (1, 2)
Comparison_1 Comparison_2
bool bool
true false

In this example, '\bpython\b' correctly matches “python” at the start of the string, while ' python ' does not. To anchor a match to the start of an entire string, use the ^ character. To anchor a match to the end of a string, use $:

(youtube_comments_clean
 .head(1)
 .select(Comparison_1=pl.lit(other_text[0]).str.contains(r'^python\b'),
         Comparison_2=pl.lit(other_text[1]).str.contains(r'^python\b'),
         Comparison_3=pl.lit(other_text[2]).str.contains(r'\bpython$'),
         Comparison_4=pl.lit(other_text[3]).str.contains(r'\bpython$'),
         )
 )
shape: (1, 4)
Comparison_1 Comparison_2 Comparison_3 Comparison_4
bool bool bool bool
true false true false
  • In the first line, the pattern matches “python” at the start of the string.
  • The second line begins with “I,” so the match fails.
  • The third line ends with “python,” so the match succeeds.
  • The fourth line ends with “over,” so no match is found.

If you are new to regular expressions, it may take some experimentation to become comfortable with them. Practicing with real-world datasets is one of the most effective ways to learn. In the next section, we will look at other Polars function that accepts regex, extract and replace.

4.6.4 Finding and replacing with regex

In the previous section, we discussed regular expressions and how to construct regex patterns to match parts of strings in a dataset. We will now apply these techniques to the YouTube comments dataset to see how they work in practice.

The Author column in the youtube_comments_clean dataframe contains the user names of viewers who commented on the videos. We can use extract to identify user names that end in one or more digits and return only the nonnumeric portion of the string. When the pattern does not find a match, Polars returns a null value.

(youtube_comments_clean
 .select('Author')
1 .with_columns(Extracted=pl.col('Author').str.extract(r'([a-zA-Z]+)\d+\b'))
 )
1
The regex pattern ([a-zA-Z]+)\d+\b captures one or more letters followed by digits at a word boundary and returns the alphabetic portion.
shape: (4_839, 2)
Author Extracted
str str
"@Stefanovic92" "Stefanovic"
"@sasker1000" "sasker"
"@cpblair" null
"@cometamítico" null
"@Hervoicehealed" null
"@kknight1990" "kknight"
"@vvs1998" "vvs"
"@Nikopup" null
"@adedapomichaelowoeye9353" "adedapomichaelowoeye"
"@Esss8499" "Esss"

Next, we can count the number of mentions for each of the four learning types: bootcamp, university, college, and online course. This cannot be done with group_by because the learning type values appear as substrings within the larger text in the values contained in the Text column.

to_plot = (youtube_comments_clean
 .select('Text')
 .with_columns(Bootcamp=pl.col('Text').str.extract(r'(?i)\bbootcamp\b', 0),
               University=pl.col('Text').str.extract(r'(?i)\buniversity\b', 0),
               College=pl.col('Text').str.extract(r'(?i)\bcollege\b', 0),
               Online_Course=pl.col('Text').str.extract(r'(?i)\bonline\ course\b', 0)
               )
 .drop('Text')
1 .with_columns(pl.all().count())
 .limit(1)
 )
to_plot
1
Unlike len, which counts all rows including nulls, count only counts rows that contain a value.
shape: (1, 4)
Bootcamp University College Online_Course
u32 u32 u32 u32
46 47 67 14

We can visualize these results using a bar chart.

(to_plot
 .rename(dict(Online_Course='Online Course'))
 .unpivot(variable_name='Learn_Type', value_name='Count')
 .sort('Count', descending=True)
 .hvplot.bar(x='Learn_Type', y='Count',
             xlabel='', ylabel='',
             title='Frequency of learning types mentioned')
 )

The chart shows that college is mentioned most frequently, followed by university, though these categories may overlap conceptually. More notably, bootcamp mentions exceed those of online courses.

In addition to finding matches, we may also want to replace matched text with alternative values. This is especially useful when cleaning text that contains multiple spellings or variations of the same concept. The replace and replace_all functions support this use case and can accept regular expressions as patterns.

(youtube_comments_clean
 .select(Original=pl.lit('crust rust code'))
 .with_columns(Normal_Replace=pl.col('Original').str.replace_all('rust','python'),
               Regex_Replace=pl.col('Original').str.replace_all(r'\brust\b','python'))
 )
shape: (1, 3)
Original Normal_Replace Regex_Replace
str str str
"crust rust code" "cpython python code" "crust python code"

Replacing text with regex is more precise because it allows us to target specific patterns within a string. In this example, the word boundaries in Regex_Replace ensure that only the standalone word “rust” is replaced, while the substring in “crust” is left unchanged.

In this section, we explored several techniques for locating and modifying elements within larger blocks of text. These ranged from simple checks such as contains, to membership tests using is_in, and more advanced regex-based searches. Combined with the parsing and transformation tools discussed earlier, these methods make it possible to design tailored rule sets for handling complex datasets. That said, it is important to balance the benefits of intricate rules against the cost of maintaining them. For one-time exploratory analyses, detailed cleaning rules can be worthwhile. For ongoing reporting or monitoring, it is often more efficient to ensure that upstream data sources provide cleaner inputs. In the next section, we will examine how to generate new text strings in Polars using constants, existing strings, and parsed values.

4.7 Constructing and Reshaping Text

So far, we have focused on how Polars parses, transforms, finds, and replaces elements within strings to support data cleaning and analysis. Beyond these operations, Polars also offers tools for constructing new text outputs. One core technique is concatenation, which combines values from different columns or data types into a single column. We will also look at how to reshape text by combining multiple columns into one row, as well as the reverse operation of splitting a single string into multiple rows.

4.7.1 Concatenation

Polars allows us to construct new strings through concatenation. We can combine constants, literal text, column values, or computed results into a single output. Several approaches are available, including the concat_str function, the add method, and the + operator.

The following example combines values in Title_1 and Title_2 into a single column called Concatenate.

(youtube_comments_clean
 .select(Title_1=pl.lit('Deep Analysis'),
         Title_2=pl.lit('with Polars'))
 .with_columns(Concatenate=pl.concat_str(pl.col('Title_1'), pl.col('Title_2'), separator=' '))
 )
shape: (1, 3)
Title_1 Title_2 Concatenate
str str str
"Deep Analysis" "with Polars" "Deep Analysis with Polars"

We can achieve the same result using the add method or the + operator.

(youtube_comments_clean
 .select(Title_1=pl.lit('Deep Analysis'),
         Title_2=pl.lit('with Polars'))
 .with_columns(Add=pl.col('Title_1').add(' ').add(pl.col('Title_2')),
               Operator=pl.col('Title_1') + ' ' + pl.col('Title_2'))
 )
shape: (1, 4)
Title_1 Title_2 Add Operator
str str str str
"Deep Analysis" "with Polars" "Deep Analysis with Polars" "Deep Analysis with Polars"

Concatenation is also useful when working with the YouTube comments dataset. In the next example, we select the five longest comments and create a dataframe that displays the author, the character count of the comment, and when it was posted.

1with pl.Config(fmt_str_lengths=100):
    display(youtube_comments_clean
     .with_columns(Char_Count=pl.col('Text').str.len_chars())
     .top_k(5, by='Char_Count')
     .with_columns(pl.concat_str(pl.lit('By: '), pl.col('Author'),
                                 pl.lit(', characters: '), pl.col('Char_Count'),
                                 pl.lit(', when: '), pl.col('Time'))
                   .alias('Concatenation')
                   )
     .select('Concatenation')
     )
1
This configuration ensures that the dataframe displays the full text content within each cell.
shape: (5, 1)
Concatenation
str
"By: @amdenis, characters: 8251, when: 5 years ago"
"By: @grantmurray6185, characters: 3616, when: 5 years ago"
"By: @patrickjane276, characters: 2858, when: 3 years ago"
"By: @mwredfern, characters: 2858, when: 5 years ago"
"By: @mwredfern, characters: 2555, when: 2 years ago"

Alongside functions and operators for building new text through concatenation, Polars also provides specialized functions for reshaping text, which we will examine next.

4.7.2 Reshaping Text

As discussed in Chapter 2, changing the shape of data is sometimes necessary. This may involve pivoting from rows to columns or the reverse, effectively transforming columns into rows. We previously used the pivot and unpivot methods for this purpose. Polars also includes specialized functions designed specifically for reshaping text.

For example, we can use the str.join expression to create a dataframe that shows authors alongside the different technologies they mention:

(youtube_comments_clean
 .select('Text','Author')
 .with_columns(pl.col('Text').str.replace_all(r'(?i)power bi', 'powerbi'))
 .with_columns(Powerbi=pl.col('Text').str.extract(r'(?i)\bpowerbi\b', 0),
               Tableau=pl.col('Text').str.extract(r'(?i)\btableau\b', 0),
               Python=pl.col('Text').str.extract(r'(?i)\bpython\b', 0),
               SQL=pl.col('Text').str.extract(r'(?i)\bsql\b', 0),
               Excel=pl.col('Text').str.extract(r'(?i)\bexcel\b', 0))
 .with_columns(pl.exclude('Author','Text').str.to_titlecase())
1 .with_columns(Technology=pl.concat_list(pl.exclude('Text','Author')).list.unique().sort())
 .select('Author','Technology')
 .explode('Technology')
 .unique()
 .group_by('Author')
 .agg(Technologies=pl.col('Technology').sort().str.join(', '))
 .filter(pl.col('Technologies') != '')
 )
1
Including sort ensures that variations such as ['Python', 'SQL'] and ['SQL', 'Python'] are treated as the same value when unique is applied later. Omitting sort would increase the number of rows returned, which would be incorrect.
shape: (322, 2)
Author Technologies
str str
"@tenou213" "Sql"
"@mironadiel" "Python, Sql"
"@joshuagithui9355" "Excel, Powerbi, Tableau"
"@MadeForD" "Excel, Python"
"@MrProtikiub" "Sql"
"@EducationGapsNoMore" "Sql"
"@yogamayaa" "Python"
"@BigUche" "Tableau"
"@LeolimaGalo" "Python"
"@vishalmane3139" "Excel, Python, Sql, Tableau"

This result shows each author and the technologies they mention as a single string, with multiple technologies separated by commas. In effect, we collapsed multiple rows per author into one. If an author mentioned three technologies, each initially appeared in its own row, but str.join combined them into a single comma-separated string.

A related use case applies the same ideas in reverse. Instead of creating a single string from multiple rows, we can create multiple rows from a single string. Consider the first comment in youtube_comments_clean. We can build a dataframe that contains one row for each word in that comment.

(youtube_comments_clean
 .head(1)
 .select(pl.col('Text').str.split(' '))
 .explode('Text')
 )
shape: (61, 1)
Text
str
"It"
"took"
"me"
"7"
"years"
"creating"
"a"
"life"
"of"
"freedom"

The explode method does not operate on the String data type, which is why we first converted Text into a List. Text can be split on any character, or even on an entire word, not just on whitespace:

(youtube_comments_clean
 .select(Original=pl.lit('Python, SQL, Tableau, Excel'))
 .with_columns(pl.col('Original').str.split(','))
 )
shape: (1, 1)
Original
list[str]
["Python", " SQL", … " Excel"]

The opposite of explode is the implode expression, which collapses multiple rows of a column into a single row:

(youtube_comments_clean
 .select(Original=pl.lit('Python, SQL, Tableau, Excel'))
 .with_columns(pl.col('Original').str.split(','))
 .explode('Original')
 .select(pl.col('Original').implode())
 )
shape: (1, 1)
Original
list[str]
["Python", " SQL", … " Excel"]

Using explode in this way also allows us to identify the most common words in text columns, which can be useful for text analysis. As an example, let us examine the most frequently used words in the comments dataset:

(youtube_comments_clean
 .select('Text')
 .with_columns(Word=pl.col('Text')
1               .str.replace_all(r'\s+', ' ')
               .str.to_lowercase()
               .str.split(' '))
 .explode('Word')
2 .with_columns(pl.col('Word').str.replace_all(r'[^A-Za-z]',''))
 .filter(pl.col('Word') != '')
 .group_by('Word')
 .agg(Frequency=pl.len())
 .sort('Frequency', descending=True)
)
1
Replaces two or more consecutive spaces with a single space.
2
Removes any punctuation characters attached to a word.
shape: (7_527, 2)
Word Frequency
str u32
"i" 4299
"to" 3429
"a" 2837
"the" 2783
"you" 2665
"paddy" 1
"laboratory" 1
"chage" 1
"capabilites" 1
"thinkful" 1

As expected, the most frequent words are largely predictable. However, they tend to provide little analytical insight because they are common terms used in everyday language. To produce a more meaningful list, we can remove what are known as stop words. These are high-frequency terms that appear in nearly all texts.

Libraries such as nltk provide collections of stop words, although these lists vary across packages and are not standardized. There is no universally accepted set of stop words, and it is common to tailor the list to a specific task. Many publicly available lists can be found online. In this example, we use a dataset containing 421 frequently used words, which we load into a dataframe named stop_words. We then remove these words from the earlier results using an anti join:

stop_words = pl.read_parquet('data/stop_words.parquet')

to_plot = (youtube_comments_clean
 .select('Text')
 .with_columns(Word=pl.col('Text')
               .str.replace_all(r'\s+', ' ')
               .str.to_lowercase()
               .str.split(' '))
 .explode('Word')
 .with_columns(pl.col('Word').str.replace_all(r'[^A-Za-z]',''))
 .filter(pl.col('Word') != '')
 .group_by('Word')
 .agg(Frequency=pl.len())
 .join(stop_words, left_on='Word', right_on='Stop_Words', how='anti')
 .sort('Frequency', descending=True)
)
to_plot
shape: (7_144, 2)
Word Frequency
str u32
"data" 1841
"im" 736
"science" 704
"thanks" 664
"video" 611
"shitty" 1
"mother" 1
"chia" 1
"forum" 1
"binary" 1

The top ten most common words after removing stop words are shown in the chart below:

(to_plot
 .head(10)
 .reverse()
 .hvplot.barh(x='Word', y='Frequency',
             xlabel='', height=450)
 .opts(fontsize={'xticks': 11, 'yticks': 11},
       title='Top 10 most common words in comments\n(without stop words)')
 )

Depending on the dataset, you may want to extend the stop-word list further to remove additional terms that do not add meaningful insight to the analysis.

Constructing and reshaping text with Polars can be as simple or as complex as needed. Concatenation, string joining, and string-splitting functions can be used independently or combined with one another, as well as with other Polars expressions and operators, to produce the desired output.

4.8 Conclusion

Although Polars is not usually the first library people consider for text-focused tasks, it offers a broad set of powerful tools for working with textual data. Whether the goal is to parse strings, transform them, perform search and replace operations, or restructure text, Polars provides the functionality needed to clean, prepare, and analyze text effectively.

In the next chapter, we turn to anomaly detection, another area where Polars is often overlooked but delivers capabilities that exceed many expectations.